11 research outputs found

    Opinion Spam detection using PU-Learning

    Get PDF
    Tesis doctoral realizada por Donato Hernández Fusilier en la Universitat Politècnica de València, dirigida por los Doctores Paolo Rosso (Universitat Politècnica de València, España, Manuel Montes-y-Gómez (Instituto Nacional de Astrofísica, Óptica y Electrónica, México ) y Rafael Guzmán (Universidad de Guanajuato, México). La defensa se efectuó el 20 de enero de 2016 en Valencia. El tribunal estuvo conformado por la Dra. Raquel Martinez Unanue de la Universidad Nacional de Educación a Distancia, Madrid como vocal, por el Dr. Carlos David Martinez Hinajeros de la Universitat Politècnica de València como secretario y por el Dr. Rafael Berlanga Llavori de la Universitat Jaume I, Castelló como presidente. La tesis obtuvo una calificación de Sobresaliente.Doctoral thesis written by Donato Hernández Fusilier at the Universitat Politècnica of València, directed by Ph.D. Paolo Rosso (Universitat Politècnica of València, Spain), Ph.D. Manuel Montes-y-Gómez (National Institute of Astrophisics, Optics and Electronics, México) and Ph.D. Rafael Guzmán (University of Guanajuato, México). The defense took place on January 20, 2016 in Valencia. The doctoral committee was integrated by the following doctors: Ph.D. Raquel Martinez Unanue of National Distance Learning University, Madrid as panel member, Ph.D. Carlos David Martinez Hinajeros de la Universitat Politècnica of València as secretary and by Ph.D. Rafael Berlanga Llavori of the Universitat Jaume I, Castelló as president. The thesis was graded as Outstanding

    Detección de opinion spam usando PU-learning

    Full text link
    Tesis por compendio[EN] Abstract The detection of false or true opinions about a product or service has become nowadays a very important problem. Recent studies show that up to 80% of people have changed their final decision on the basis of opinions checked on the web. Some of these opinions may be false, positive in order to promote a product/service or negative to discredit it. To help solving this problem in this thesis is proposed a new method for detection of false opinions, called PU-Learning*, which increases the precision by an iterative algorithm. It also solves the problem of lack of labeled opinions. To operate the method proposed only a small set of opinions labeled as positive and another large set of opinions unlabeled are needed. From this last set, missing negative opinions are extracted and used to achieve a two classes binary classification. This scenario has become a very common situation in the available corpora. As a second contribution, we propose a representation based on n-grams of characters. This representation has the advantage of capturing both the content and the writing style, allowing for improving the effectiveness of the proposed method for the detection of false opinions. The experimental evaluation of the method was carried out by conducting three experiments classification of opinions, using two different collections. The results obtained in each experiment allow seeing the effectiveness of proposed method as well as differences between the use of several types of attributes. Because the veracity or falsity of the reviews expressed by users becomes a very important parameter in decision making, the method presented here, can be used in any corpus where you have the above characteristics.[ES] Resumen La detección de opiniones falsas o verdaderas acerca de un producto o servicio, se ha convertido en un problema muy relevante de nuestra 'época. Según estudios recientes hasta el 80% de las personas han cambiado su decisión final basados en las opiniones revisadas en la web. Algunas de estas opiniones pueden ser falsas positivas, con la finalidad de promover un producto, o falsas negativas para desacreditarlo. Para ayudar a resolver este problema se propone en esta tesis un nuevo método para la detección de opiniones falsas, llamado PU-Learning modificado. Este método aumenta la precisión mediante un algoritmo iterativo y resuelve el problema de la falta de opiniones etiquetadas. Para el funcionamiento del método propuesto se utilizan un conjunto pequeño de opiniones etiquetadas como falsas y otro conjunto grande de opiniones no etiquetadas, del cual se extraen las opiniones faltantes y así lograr una clasificación de dos clases. Este tipo de escenario se ha convertido en una situación muy común en los corpus de opiniones disponibles. Como una segunda contribución se propone una representación basada en n-gramas de caracteres. Esta representación tiene la ventaja de capturar tanto elementos de contenido como del estilo de escritura, permitiendo con ello mejorar la efectividad del método propuesto en la detección de opiniones falsas. La evaluación experimental del método se llevó a cabo mediante tres experimentos de clasificación de opiniones utilizando dos colecciones diferentes. Los resultados obtenidos en cada experimento permiten ver la efectividad del método propuesto así como también las diferencias entre la utilización de varios tipos de atributos. Dado que la falsedad o veracidad de las opiniones vertidas por los usuarios, se convierte en un parámetro muy importante en la toma de decisiones, el método que aquí se presenta, puede ser utilizado en cualquier corpus donde se tengan las características mencionadas antes.[CA] Resum La detecció d'opinions falses o vertaderes al voltant d'un producte o servei s'ha convertit en un problema força rellevant de la nostra època. Segons estudis recents, fins el 80\% de les persones han canviat la seua decisió final en base a les opinions revisades en la web. Algunes d'aquestes opinions poden ser falses positives, amb la finalitat de promoure un producte, o falses negatives per tal de desacreditarlo. Per a ajudar a resoldre aquest problema es proposa en aquesta tesi un nou mètode de detecció d'opinions falses, anomenat PU-Learning*. Aquest mètode augmenta la precisió mitjançant un algoritme iteratiu i resol el problema de la falta d'opinions etiquetades. Per al funcionament del mètode proposat, s'utilitzen un conjunt reduït d'opinions etiquetades com a falses i un altre conjunt gran d'opinions no etiquetades, del qual se n'extrauen les opinions que faltaven i, així, aconseguir una classificació de dues classes. Aquest tipus d'escenari s'ha convertit en una situació molt comuna en els corpus d'opinions de què es disposa. Com una segona contribució es proposa una representació basada en n-gramas de caràcters. Aquesta representació té l'avantatge de capturar tant elements de contingut com a d'estil d'escriptura, permetent amb això millorar l'efectivitat del mètode proposat en la detecció d'opinions falses. L'avaluació experimental del mètode es va dur a terme mitjançant tres experiments de classificació d'opinions utilitzant dues coleccions diferents. Els resultats obtingut en cada experiment permeten veure l'efectivitat del mètode proposat, així com també les diferències entre la utilització de varis tipus d'atributs. Ja que la falsedat o veracitat de les opinions vessades pels usuaris es converteix en un paràmetre molt important en la presa de decisions, el mètode que ací es presenta pot ser utilitzat en qualsevol corpus on es troben les característiques abans esmentades.Hernández Fusilier, D. (2016). Detección de opinion spam usando PU-learning [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/61990TESISCompendi

    Detection of opinion spam with character n-grams

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-18117-2_21In this paper we consider the detection of opinion spam as a stylistic classi cation task because, given a particular domain, the deceptive and truthful opinions are similar in content but di ffer in the way opinions are written (style). Particularly, we propose using character ngrams as features since they have shown to capture lexical content as well as stylistic information. We evaluated our approach on a standard corpus composed of 1600 hotel reviews, considering positive and negative reviews. We compared the results obtained with character n-grams against the ones with word n-grams. Moreover, we evaluated the e ffectiveness of character n-grams decreasing the training set size in order to simulate real training conditions. The results obtained show that character n-grams are good features for the detection of opinion spam; they seem to be able to capture better than word n-grams the content of deceptive opinions and the writing style of the deceiver. In particular, results show an improvement of 2:3% and 2:1% over the word-based representations in the detection of positive and negative deceptive opinions respectively. Furthermore, character n-grams allow to obtain a good performance also with a very small training corpus. Using only 25% of the training set, a Na ve Bayes classi er showed F1 values up to 0.80 for both opinion polarities.This work is the result of the collaboration in the frame-work of the WIQEI IRSES project (Grant No. 269180) within the FP7 Marie Curie. The second author was partially supported by the LACCIR programme under project ID R1212LAC006. Accordingly, the work of the third author was in the framework the DIANA-APPLICATIONS-Finding Hidden Knowledge inTexts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Hernández Fusilier, D.; Montes Gomez, M.; Rosso, P.; Guzmán Cabrera, R. (2015). Detection of opinion spam with character n-grams. En Computational Linguistics and Intelligent Text Processing: 16th International Conference, CICLing 2015, Cairo, Egypt, April 14-20, 2015, Proceedings, Part II. Springer International Publishing. 285-294. https://doi.org/10.1007/978-3-319-18117-2_21S285294Blamey, B., Crick, T., Oatley, G.: RU:-) or:-(? character-vs. word-gram feature selection for sentiment classification of OSN corpora. Research and Development in Intelligent Systems XXIX, 207–212 (2012)Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (2002)Feng, S., Banerjee, R., Choi, Y.: Syntactic Stylometry for Deception Detection. Association for Computational Linguistics, short paper. ACL (2012)Feng, S., Xing, L., Gogar, A., Choi, Y.: Distributional Footprints of Deceptive Product Reviews. In: Proceedings of the 2012 International AAAI Conference on WebBlogs and Social Media (June 2012)Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web Spam with Trust Rank. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 576–587. VLDB Endowment (2004)Hall, M., Eibe, F., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA Data Mining Software: an Update. SIGKDD Explor. Newsl. 10–18 (2009)Hernández-Fusilier, D., Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P.: Using PU-learning to Detect Deceptive Opinion Spam. In: Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, USA, pp. 38–45 (2013)Hernández-Fusilier, D., Montes-y-Gómez, M., Rosso, P., Guzmán-Cabrera, R.: Detecting Positive and Negative Deceptive Opinions using PU-learning. Information Processing & Management (2014), doi:10.1016/j.ipm.2014.11.001Jindal, N., Liu, B.: Opinion Spam and Analysis. In: Proceedings of the International Conference on Web Search and Web Data Mining, pp. 219–230 (2008)Jindal, N., Liu, B., Lim, E.: Finding Unusual Review Patterns Using Unexpected Rules. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 210–220(October 2010)Kanaris, I., Kanaris, K., Houvardas, I., Stamatatos, E.: Word versus character n-grams for anti-spam filtering. International Journal on Artificial Intelligence Tools 16(6), 1047–1067 (2007)Lim, E.P., Nguyen, V.A., Jindal, N., Liu, B., Lauw, H.W.: Detecting Product Review Spammers Using Rating Behaviours. In: CIKM, pp. 939–948 (2010)Liu, B.: Sentiment Analysis and Opinion Mining. Synthesis Lecture on Human Language Technologies. Morgan & Claypool Publishers (2012)Mukherjee, A., Liu, B., Wang, J., Glance, N., Jindal, N.: Detecting Group Review Spam. In: Proceedings of the 20th International Conference Companion on World Wide Web, pp. 93–94 (2011)Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting Spam Web Pages through Content Analysis. Transactions on Management Information Systems (TMIS), 83–92 (2006)Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding Deceptive Opinion Spam by any Stretch of the Imagination. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 309–319 (2011)Ott, M., Cardie, C., Hancock, J.T.: Negative Deceptive Opinion Spam. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, USA, pp. 309–319 (2013)Raymond, Y.K., Lau, S.Y., Liao, R., Chi-Wai, K., Kaiquan, X., Yunqing, X., Yuefeng, L.: Text Mining and Probabilistic Modeling for Online Review Spam Detection. ACM Transactions on Management Information Systems 2(4), Article: 25, 1–30 (2011)Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. Journal of Law & Policy 21(2) (2013)Wu, G., Greene, D., Cunningham, P.: Merging Multiple Criteria to Identify Suspicious Reviews. In: RecSys 2010, pp. 241–244 (2010)Xie, S., Wang, G., Lin, S., Yu, P.S.: Review Spam Detection via Time Series Pattern Discovery. In: Proceedings of the 21st International Conference Companion on World Wide Web, pp. 635–636 (2012)Zhou, L., Sh, Y., Zhang, D.: A Statistical Language Modeling Approach to Online Deception Detection. IEEE Transactions on Knowledge and Data Engineering 20(8), 1077–1081 (2008

    Detecting Positive and Negative Deceptive Opinions using PU-learning

    Full text link
    [EN] Nowadays a large number of opinion reviews are posted on the Web. Such reviews are a very important source of information for customers and companies. The former rely more than ever on online reviews to make their purchase decisions, and the latter to respond promptly to their clients’ expectations. Unfortunately, due to the business that is behind, there is an increasing number of deceptive opinions, that is, fictitious opinions that have been deliberately written to sound authentic, in order to deceive the consumers promoting a low quality product (positive deceptive opinions) or criticizing a potentially good quality one (negative deceptive opinions). In this paper we focus on the detection of both types of deceptive opinions, positive and negative. Due to the scarcity of examples of deceptive opinions, we propose to approach the problem of the detection of deceptive opinions employing PU-learning. PU-learning is a semi-supervised technique for building a binary classifier on the basis of positive (i.e., deceptive opinions) and unlabeled examples only. Concretely, we propose a novel method that with respect to its original version is much more conservative at the moment of selecting the negative examples (i.e., not deceptive opinions) from the unlabeled ones. The obtained results show that the proposed PU-learning method consistently outperformed the original PU-learning approach. In particular, results show an average improvement of 8.2% and 1.6% over the original approach in the detection of positive and negative deceptive opinions respectively. 2014 Elsevier Ltd. All rights reserved.This work is the result of the collaboration in the framework of the WIQEI IRSES project (Grant No. 269180) within the FP 7 Marie Curie. The work of the third author was in the framework the DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Hernández Fusilier, D.; Montes Gómez, M.; Rosso, P.; Guzmán Cabrera, R. (2015). Detecting Positive and Negative Deceptive Opinions using PU-learning. Information Processing and Management. 51(4):433-443. https://doi.org/10.1016/j.ipm.2014.11.001S43344351

    On the Use of PU Learning for Quality Flaw Prediction in Wikipedia

    Full text link
    [EN] In this article we describe a new approach to assess Quality Flaw Prediction in Wikipedia. The partially supervised method studied, called PU Learning, has been successfully applied in classi cations tasks with traditional corpora like Reuters-21578 or 20-Newsgroups. To the best of our knowledge, this is the rst time that it is applied in this domain. Throughout this paper, we describe how the original PU Learning approach was evaluated for assessing quality flaws and the modi cations introduced to get a quality aws predictor which obtained the best F1 scores in the task \Quality Flaw Prediction in Wikipedia" of the PAN challenge.Edgardo Ferretti and Marcelo Errecalde thank Universidad Nacional de San Luis (PROICO 30310). The collaboration of UNSL, INAOE and UPV has been funded by the European Commission as part of the WIQ-EI project (project no. 269180) within the FP7 People Programme. Manuel Montes is partially supported by CONACYT, No. 134186. The work of Paolo Rosso was carried out also in the framework of the MICINN Text-Enterprise (TIN2009-13391-C04-03) research project and the Microcluster VLC/Campus (International Campus of Excellence) on Multimodal Intelligent Systems.Ferretti, E.; Hernández Fusilier, D.; Guzmán Cabrera, R.; Montes Y Gómez, M.; Errecalde, M.; Rosso, P. (2012). On the Use of PU Learning for Quality Flaw Prediction in Wikipedia. CEUR Workshop Proceedings. 1178. http://hdl.handle.net/10251/46566S117

    An OOP Approach to Simplify MDI Application Development An OOP Approach to Simplify MDI Application Development

    No full text
    The Multiple Document Interface (MDI) is a Microsoft Windows specification that allows managing multiple documents using a single graphic interface application. An MDI application allows opening several documents simultaneously. Only one document is active at a particular time. MDI applications can be deployed using Win32 or Microsoft Foundation Classes (MFC). Programs developed using Win32 are faster than those using MFC. However, Win32applications are difficult to implement and prone to errors. It should be mentioned that, learning how to properly use MFC to deploy MDI applications is not simple, and performance is typically worse than that of Win32 applications. A method to simplify the development of MDI applications using Object-Oriented Programming (OOP) is proposed. Subsequently, it is shown that this method generates compact code that is easier to read and maintain than other methods (i.e., MFC). Finally, it is demonstrated that the proposed method allowsthe rapid development of MDI applications without sacrificing application performance.<br>La Interfase para Múltiples Documentos (MDI) es una especificación del sistema operativo Microsoft Windows que permite manipular varios documentos usando un sólo programa. Un programa del tipo MDI permite abrir varios documentos simultáneamente. En un instante dado, sólo un documento es activo. Los programas del tipo MDI pueden desarrollarseu sando Win32 o las clases fundamentales de Microsoft (MFC.) Los programas desarrollados usando Win32 son más rápidos que los programas que usan MFC. Sin embargo, éstos son difíciles de implementar promoviendo la existencia de errores. Cabe mencionar que el desarrollo de programas del tipo MDI usando MFC no es sencillo, y que su desempeño estípicamente peor que el de un programa del tipo Win32. Se propone un método que drásticamente simplifica el desarrollo de programas del tipo MDI por medio de la Programación Orientada a Objetos (POO.) Se demuestra que el método propuesto produce código que es más fácil de leer y mantener que el resultante por otros métodos (por ejemplo MFC). Adicionalmente, se demuestra que el método propuesto permite el rápido desarrollo de programas del tipo MDI sin afectar la velocidad del programa

    Detection of Opinion Spam with Character n-grams

    No full text
    Abstract. In this paper we consider the detection of opinion spam as a stylistic classification task because, given a particular domain, the deceptive and truthful opinions are similar in content but differ in the way opinions are written (style). Particularly, we propose using character ngrams as features since they have shown to capture lexical content as well as stylistic information. We evaluated our approach on a standard corpus composed of 1600 hotel reviews, considering positive and negative reviews. We compared the results obtained with character n-grams against the ones with word n-grams. Moreover, we evaluated the effectiveness of character n-grams decreasing the training set size in order to simulate real training conditions. The results obtained show that character n-grams are good features for the detection of opinion spam; they seem to be able to capture better than word n-grams the content of deceptive opinions and the writing style of the deceiver. In particular, results show an improvement of 2.3% and 2.1% over the word-based representations in the detection of positive and negative deceptive opinions respectively. Furthermore, character n-grams allow to obtain a good performance also with a very small training corpus. Using only 25% of the training set, a Naïve Bayes classifier showed F1 values up to 0.80 for both opinion polarities

    A Method to Ease the Deployment of Web Applications that Involve Database Systems A Method to Ease the Deployment of Web Applications that Involve Database Systems

    No full text
    El crecimiento continuo de la Internet ha permitido a las personas, alrededor de todo mundo, realizar transacciones en línea, buscar información o navegar usando el explorador de la Web. A medida que más gente se siente cómoda usando los exploradores de Web, más empresas productoras de software tratan de ofrecer interfaces Web como una forma alternativa para proporcionar acceso a sus aplicaciones. La naturaleza de la conexión Web y las restricciones impuestas por el ancho de banda disponible, hacen la integración de aplicaciones Web y los sistemas de bases de datos críticas. Debido a que las aplicaciones que usan bases de datos proporcionan una interfase gráfica para editar la información en la base de datos y debido a que cada columna en una tabla de una base de datos corresponde a un control en una interfase gráfica, el desarrollo de estas aplicaciones puede consumirun tiempo considerable, ya que la validación de campos y reglas de integridad referencial deben ser respetadas. Se propone un diseño orientado a objetos para así facilitar el desarrollo de aplicaciones que usan sistemas de bases de datos.<br>The continuous growth of the Internet has driven people, all around the globe, to performtransactions on-line, search information or navigate using a browser. As more people feelcomfortable using a Web browser, more software companies are trying to alternatively offerWeb interfaces to provide access to their applications. The consequent nature of the Webconnection and the restrictions imposed by the available bandwidth make the successfulintegration of Web applications and database systems critical. Because popular databaseapplications provide a user interface to edit and maintain the information in the databaseand because each column in the database table maps to a graphic user interface control,the deployment of these applications can be time consuming; appropriate fi eld validationand referential integrity rules must be observed. Thus, an object-oriented approach is proposedto ease the development of applications that involve database systems

    Multiple criteria fake reviews detection using belief function theory

    No full text
    International audienceChecking online reviews before making a purchase becomes a permanent habit. Hence, online consumer reviews, product and services play an increasingly spreading role in consumer purchasing decisions. Unfortunately, the importance of advertising and the attraction of profit have led to the appearance of fake reviews in order to mislead readers. Considering that the reviews are generally imperfect, the spam reviews detection becomes one of the most important problems. To tackle this problem, we propose a new method of multi-criteria fake reviews under belief function theory. This approach treats the uncertainty in the rating reviewers' given to multiple evaluation criteria, takes into account the similarity between all provided reviews and deals with missing data. We evaluate our method through artificial datasets. Then, we use a real dataset to validate it. The results prove that the proposed approach is a useful solution for the fake reviews detection problem
    corecore